# CS-200 Computer Architecture

Part 4e. Instruction Level Parallelism Scheduling Examples

Paolo lenne <paolo.ienne@epfl.ch>

### **RISC-V Program**

| 1:  | lw   | x1,         | 0(x5) |            |
|-----|------|-------------|-------|------------|
| 2:  | addi | x5,         | x1,   | 1          |
| 3:  | lw   | <b>x1</b> , | 0(x6) |            |
| 4:  | add  | x3,         | x8,   | <b>x</b> 5 |
| 5:  | subi | x2,         | x6,   | 1          |
| 6:  | subi | x4,         | x3,   | 5          |
| 7:  | add  | x3,         | x2,   | <b>x</b> 4 |
| 8:  | 1w   | x2,         | 0(x7  | 7)         |
| 9:  | or   | x4,         | x2,   | <b>x</b> 1 |
| 10: | subi | x7,         | x3,   | 9          |

#### Goal

- Determine the execution schedule for this code in five different architectures
- Compare the number of cycles required by the five architectures
- Compute the CPI (and IPC) of all architectures on this program

- Multicycle processor, not pipelined
- Execution latencies:
  - ALU Operations → 4 cycles
  - − Memory Operations→ 6 cycles

```
1: lw
           x1, 0(x5)
2: addi
           x5, x1, 1
3: lw
            x1, 0(x6)
4: add
          x3, x8, x5
5: subi
          x2, x6, 1
6: subi
           x4, x3, 5
7: add
           x3, x2, x4
8: 1w
       x2, 0(x7)
9: or
           x4, x2, x1
           x7, x3, 9
10: subi
```

ALU Operations → 4 cycles

Memory Operations → 6 cycles



• 6-stage pipelined, no forwarding paths



1: lw x1, 0(x5)2: addi x5, x1, 1 x1, 0(x6) lw add x3, x8, x5 x2, x6, 1 5: subi x4, x3, 5 6: subi x3, x2, x4 **7:** add 1w x2, 0(x7)9: or x4, x2, x1 x7, x3, 9 10: subi





- 6-stage pipelined, <u>some</u> forwarding paths:
  - $-E \rightarrow E, M2 \rightarrow E$
  - $-W\rightarrow D$



1: lw x1, 0(x5)
2: addi x5, x1, 1
3: lw x1, 0(x6)
4: add x3, x8, x5
5: subi x2, x6, 1
6: subi x4, x3, 5
7: add x3, x2, x4
8: lw x2, 0(x7)
9: or x4, x2, x1
10: subi x7, x3, 9





CPI = 2.1

- 6-stage pipelined, <u>all</u> forwarding paths:
  - $-E \rightarrow E$ , M1 $\rightarrow E$ , M2 $\rightarrow E$
  - $-W\rightarrow D$



```
1: lw x1, 0(x5)
2: addi x5, x1, 1
3: lw x1, 0(x6)
4: add x3, x8, x5
5: subi x2, x6, 1
6: subi x4, x3, 5
7: add x3, x2, x4
8: lw x2, 0(x7)
9: or x4, x2, x1
10: subi x7, x3, 9
```





CPI = 1.9

- Dynamically scheduled, out-of-order (OOO), unlimited RS and ROB size
- 1 ALU (latency 1) + 1 Memory Unit (latency 3)



### **ALU Reservation Station (RS)**



### **MEM Reservation Station (RS)**



### **Reordering Buffer (ROB)**



## "To Do List" for Simulating Dynamically Scheduled Processors

- At the beginning of each cycle
  - E phase—Issue ready instructions:
    - From all RSs, issue as many ready instructions as there are FUs available
  - W phase—Writeback results in order:
    - Remove top entry from ROB if completed
- At the end of each cycle
  - D phase—Load result of decoding stage:
    - To the relevant RS, including ready register values
    - To the ROB, to prepare the placeholder for the result
  - E phase—Broadcast results from all FUs:
    - To all RSs (incl. deallocation of the entry)
    - To the ROB







| 1:  | lw         | x1, | 0(x! | 5)         |
|-----|------------|-----|------|------------|
| 2:  | addi       | x5, | x1,  | 1          |
| 3:  | <b>1</b> w | x1, | 0(x6 | 5)         |
| 4:  | add        | x3, | x8,  | <b>x</b> 5 |
| 5:  | subi       | x2, | x6,  | 1          |
| 6:  | subi       | x4, | x3,  | 5          |
| 7:  | add        | x3, | x2,  | x4         |
| 8:  | <b>1</b> w | x2, | 0(x  | 7)         |
| 9:  | or         | x4, | x2,  | <b>x1</b>  |
| 10: | subi       | x7, | x3,  | 9          |
|     |            |     |      |            |



# Solution ALU RS at the end of the 6<sup>th</sup> cycle



Both instructions are waiting and have all operands ready: if the architecture had multiple ALUs, both instructions could execute on the next cycle...

# Solution MEM RS at the end of the 6<sup>th</sup> cycle



# Solution ROB at the end of the 6<sup>th</sup> cycle

